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Abstract 

Entropy based measures have been frequently used in symbolic sequence analysis. A symmetrized and smoothed form of 
Kullback-Leibler divergence or relative entropy, the Jensen-Shannon divergence (JSD), is of particular interest because of its 
sharing properties with families of other divergence measures and its interpretability in different domains including 
statistical physics, information theory and mathematical statistics. The uniqueness and versatility of this measure arise 
because of a number of attributes including generalization to any number of probability distributions and association of 
weights to the distributions. Furthermore, its entropic formulation allows its generalization in different statistical 
frameworks, such as, non-extensive Tsallis statistics and higher order IVlarkovian statistics. We revisit these generalizations 
and propose a new generalization of JSD in the integrated Tsallis and IVlarkovian statistical framework. We show that this 
generalization can be interpreted in terms of mutual information. We also investigate the performance of different JSD 
generalizations in deconstructing chimeric DNA sequences assembled from bacterial genomes including that of £ coli, S. 
enterica typhi, Y. pestis and H. influenzae. Our results show that the JSD generalizations bring in more pronounced 
improvements when the sequences being compared are from phylogenetically proximal organisms, which are often difficult 
to distinguish because of their compositional similarity. While small but noticeable improvements were observed with the 
Tsallis statistical JSD generalization, relatively large improvements were observed with the Markovian generalization. In 
contrast, the proposed Tsallis-Markovian generalization yielded more pronounced improvements relative to the Tsallis and 
Markovian generalizations, specifically when the sequences being compared arose from phylogenetically proximal 
organisms. 



Citation: Re MA, Azad RK (2014) Generalization of Entropy Based Divergence Measures for Symbolic Sequence Analysis. PLoS ONE 9(4): e93532. doi:10.1371/ 
journal. pone.0093532 

Editor: Kay Hamacher, Technical University Darmstadt, Germany 

Received September 17, 2013; Accepted March 4, 2014; Published April 11, 2014 

Copyright: © 2014 Re, Azad. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted 
use, distribution, and reproduction in any medium, provided the original author and source are credited. 

Funding: This work was supported by grant UTI1655, UTN, PRC to M.A.R., and a faculty start-up fund and 2013 JPSRP award from the University of North Texas to 
R.K.A. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. 

Competing interests: The authors have declared that no competing interests exist. 

* E-mail: Rajeev.Azad@unt.edu 



Introduction 

The statistical analysis of symbolic sequences is of great interest 
in diverse fields, such as, linguistics, image processing or biological 
sequence analysis. Information-theoretic measures based on 
Boltzmann-Gibbs-Shannon Entropy (BGSE) have been frequently 
used for interpreting discrete, symbolic data [1]. Using informa- 
tion-theoretic functional makes it unnecessary to map the 
symbolic sequence to a numeric sequence. Given a random 
variable Xwith k possible values i =1, 2, k, BGSE of the 
probability distribution pj^- is defined as, 

k 
;=1 

BGSE has an additivity property: Let X and Y be two 
statistically independent variables and Px and p^be their 
corresponding probability distributions so that their joint proba- 
bility distribution is the product of their marginal distributions: 
Vxy = Vx9y Then, 



^^i[Px5'1=^i[Px1+^i[Py1- (2) 

The central role played by BGSE in information theory has 
encouraged the proposals of generalization of this function. 
Outstanding in the realm of statistical physics has been the Tsallis 
generalization of BGSE [2,3], which was obtained by substituting 
natural logarithm by its deformed expression [4], 

k 

^Jpi=-E^(^')"iq^(^')' (3) 

i=l 

with the deformed definition, 

\-q 

.where ij is a real number and in the limit q-^l, Iq-^ln and BGSE 
is recovered. Index q gives a measure of the non-extensivity of the 
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generalization as expressed by the pseudo-additivity rule [2,3]: 

H,[9x\>y]=H,[9x]+H,[9y] +(1 -?)i/,[Px]^>y]- (4) 

In the limit q— >1, the BGSE additivity as in eqn. 2 is recovered. 

Measures based on BGSE have been proposed for measuring 
the difference between probability distributions. This includes the 
KuUback-Leibler divergence and its symmetrized forms [5]. Lin 
introduced the Jensen-Shannon divergence (JSD) as a generaliza- 
tion of a symmetrized version of Kulback-Leibler divergence, 
assigning weights to the probability distributions involved accord- 
ing to their relative importance [5]. Subsequently, different 
generalizations of JSD were proposed, either within the framework 
of Tsallis statistics [6] or within Markovian statistical framework 
[7]. WhUe the former exploits the non-extensivity implicit in the 
Tsallis generalization of BGSE, the latter is based on conditional 
entropy that facilitates exploiting higher order correlations within 
symbohc sequences. Since the latter was obtained within the 
framework of Markov chain models, this generalization was 
named Markovian Jensen-Shannon divergence (MJSD) and was 
shown to significandy outperform standard JSD in its application 
to deciphering genomic heterogeneities [7,8]. 

Because of the importance and usefulness of JSD in 
different disciphnes, significant advances have been made in 
the generalization and interpretation of this measure. Yet a 
comprehensive treatise on generalization as well as compara- 
tive assessment of the generalized measures has remained 
elusive. Here, we have attempted to bridge the gaps by 
providing the missing details. Furthermore, we present here a 
non-extensive generalization of MJSD within the Tsallis 
statistical framework. The flexibility afforded by the integrated 
Tsallis-Markovian generalization has spawned new opportu- 
nities for (re-)visiting and exploring the symbolic sequence data 
prevalent in different domains. In the following section, we 
summarize the standard JSD, its properties and its interpre- 
tation in different contexts. This was leveraged to demonstrate 
in the next sections that certain interpretations are readily 
amenable to different generalizations of JSD including the 
proposed Tsallis-Markovian generalization. In section 3, we 
describe non-extensive JSD generalization, followed by condi- 
tional dependence based or Markovian generalization in 
section 4. In section 5, we propose a non-extensive general- 
ization of the Markovian generalization of JSD. Finally, in 
section 6, we present a comparative assessment of the 
generalized measures in deconstructing chimeric DNA se- 
quence constructs. Note also that in the following sections, for 
the sake of simplicity, we obtain the generalizations of 
JSD for two probability distributions or symbolic sequences. 
The generalization to any number of distributions or 
sequences is straightforward (as with the standard JSD, Eqn. 
9 in section 2). 

Theory and Methods 

1. The Jensen-Shannon Divergence Measure 

Consider a discrete random variable X (with k possible values) 
and two probability distributions for X, pj and p2. The KuUback- 
Leibler information gain or KuUback-Leibler divergence (KLD) is 
defined as [1], 



■^i[Pi.P2]= X^Pi(e<)ln 



Plied ' 



KLD is not symmetric and requires absolute continuity {pi{Xj) = 0 
when p2{xj) =0). To overcome these shortcomings, Lin [5] 
introduced a symmetrized generahzation of KLD, namely, the 
L-divergence, defined as, 



(6) 



which can be expressed in an entropic form, i.e. 



The generalization of the L divergence is straightforward, defined 
as Jensen-Shannon divergence. 



■Dl[PbP2] =i^l[7tlPi +7t2P2] -711 Jyi[Pl] -712/^1 [pj, (8) 

where HI [.] is BGSE (Eqn. 1). The weights 7i,- associated with the 
probability distributions pi allow assigning differential importance 
to each probability distribution. JSD does not require absolute 
continuity of probability distributions with respect to each other. 
Furthermore, JSD can be readUy extended to include more than 
two probability distributions. 



■DitPlv-,P„]=^?i 



-^7r,-//i[p,] 



(9) 



given n probabUity distributions. 

Being the natural logarithm of a concave function, JSD is non- 
negati\'c, Z)i [pi ,...,p^] > 0,as can be verified from Jensen's 
inequality. In adcUtion to non-negativity and symmetricity, JSD 
also has a lower and upper bound, OSJSDSl, and has been 
shown to be the square of a metric [6,7,9,10]. Because of these 
interesting properties, this measure has been successfully applied to 
solving a variet)' of problems arising from different fields including 
molecular biology (e.g. DNA sequence analysis) [9,11-17], 
condensed matter physics [18], atomic and molecular physics 
[19], and engineering (e.g. edge detection in digital imaging) [20]. 

Grosse el al. gave three intuitive interpretations of JSD in the 
framework of statistical physics, information theory and mathe- 
matical statistics [9] . Since we intend to show in the later sections 
that some of these interpretations could be readUy extended to the 
generalized JSD measures, we briefly describe below the three 
interpretations of JSD. 

Interpretation A (lA): Framework of statistical 
physics. In the framework of statistical physics, JSD can be 
interpreted as the intensive entropy of mixing. Considering two 
vessels with a mixture of ideal gases, the mixing entropy is 
obtained as, 



(10) 



(5) 



where kg is Boltzmann constant, s is the number of vessels, 
denotes the number of gas particles in vessel s, 

N= J «® denotes the total number of ideal gas particles. 
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f*^' denotes vector of molar fractions of the gases in vessel s, and 
f = [«''V-^]^*^' denotes the vector of molar fractions of all 

gases in the mixture. Under this interpretation, 



Di=H^i^/NkB, 



(11) 



identifying ns = n^^^ /N . Given s subsequences, Dl could thus be 
interpreted as the overall difference between the entropy of the 
total sequence and the weighted average of the entropies of 
subsequences (each subsequence represented by a probability 
distribution, see Eqn. 9). 

Interpretation B (IB): Framework of information 
theory. In the framework of information theory, Dj can be 
interpreted as the mutual information. Consider two subsequences 
5"!, ^'2 of length ni and symbols respectively, derived from an 
alphabet^ - {ei, «,t}of^ symbols. The mutual information of 
symbols and the subsequences they belong to (denoted E and S 
respectively, representing all symbols and all subsequences) is given 



as. 



^Hi[v]-H,[v\k], 

.which is the reduction in the uncertainty of E due to the 

knowledge of S. Here, p (e„ Sjj is the joint probability of variables e, 
and Sj. The marginal probabilities 7i(5,)and ^(e,)are defined as. 



; = 1 
2 



n(Sj) 
N ' 



(13) 



7=1 

and the conditional entropy //i[p|7c]is defined as. 



Hi m = -J2 ^Sj) Y^PsM-^ ^^PSjied, (14) 

where the conditional probabilityj7s^.(e,)=/>(e,-,Sy)/7i(iy),which is 
the probability of finding symbol ei in the given subsequence Sj. 
Mutual information can be rewritten as. 



* 2 Ps (Ci) 

/i(£;5)= ^^7t(5>s/e,)ln^ 

!=1 ]=\ 



pied 



(15) 



Recognizing p{ei) = n{Si)ps^iei) + n{S2)ps2i^d^^ this last expres- 
sion, we re-obtain (8) 

Interpretation C (IC): Framework of mathematical 

statistics. In the framework of mathematical statistics, Dj can 
be interpreted as the log-likehhood ratio. Consider the sequence S 
composed of jV symbols as in IB but we now ask for the probability 
distribution p that maximizes the likelihood of S. The maximum 
likelihood principle suggests. 



lnU^=-NH[f] 



(16) 



with f{ei) = N{ei)/ . N{ei), i.e. the relative frequency of symbol 
e, in the sequence S. The probability distribution that maximizes 
the likelihood is p = f. A similar calculation can be carried out for 
the likelihood of subsequences Sj composing the sequence S. Under 
this interpretation, we have. 



a; 

AL _j=i 



(17) 



Here, AL is the log-likelihood ratio which gives a measure of the 
increase in the log-likeKhood when sequence S is modeled as a 
concatenation of two subsequences. 

2. Non-extensive Generalization of JSD 

Several forms of generalization in terms of non-extensive 
entropy (Eqn. 3), introduced by TsaUis in modeling physical 
systems with long range interactions [3], have been suggested. The 
different JSD gcnc'ralizations found in the literature can be 
interpreted under the schema presented in the previous section as 
lA or IB. A key concept in these generalizations is that of mutual 
information measure. 

Burbea and Rao [21] defined a generalized mutual information 
measure via entropy substitution, which may be interpreted as in 
lA. The generalized JSD can be obtained by merely substituting 
H, by in Eqn. 8: 

Df [Pi .Pzl = H^[n,Vi + rczPj] - 7iii/,[Pil - K2i/,[P21 . (18) 

An alternative generalization was obtained by Lamberti and 
Majtey [6] via the non-extensive generalization of KL divergence 
proposed by TsaUis [22]: 



(19) 



The symmetrized L-divergence, in the framework of TsaUis 
statistics, was obtained as, 



P1+P2 



'? [P2. 



P1+P2I 



(20) 



The Lrj-divergence was shown to generalize to JSq-divergence, 
replacing equal weights for the two distributions with any arbitrary 
weights 71 1 and 7t2 associated with pi and p2. However, this 
generalization does not assume fuU entropic form as D'^ [6] : 



-Of [Pi.pJ = -Y^ [■^\p\{ei) + n2pl(e,)]\-0,[nipi{ed + n2P2{ed] 

-7Cl//j[p,]-7t2i/,[P2]. 



(21) 



Jensen's inequality aUows to show that ^f [Pi,P2l [Pi)P2l- 
We have put the supraindex IB in the former as this generalization 
has an interpretation in mutual information. Z)f [pj,P2]can be 
rewritten as. 



i>f [Pi,P2] = - S'^yE^';fe)[lqp(eO-lqPy(eO 

7=1 '=1 



7=1 i=\ 



pjed 
Pjied' 



(22) 
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Figure 1 . Error (in base pairs) in detecting tKie join point in the chimeric sequence constructs for £ coli QjS. enterica, E. coli @ Y- pestis, 
and £ coli Q)H. /n^/i/e/7zae(© denotes concatenation). The proposed Tsallis-Marl<ovian generalization of the Jensen-Shannon divergence measure 
was used to obtain the mean and standard deviation of the error from 1 0,000 replicates for each type of chimeric sequence constructs. The error in 
localizing the join point was obtained as the absolute difference between the position where the divergence was maximized and the position of the 
join point (at 1 0 Kbp) in a chimeric sequence construct of size 20 Kbp. Error statistics for the two special cases of the proposed generalized measure is 
shown within rectangular boxes— the Markovian generalization (q = 1) in dashed green border box and Tsallis non-extensive generalization (model 
order = 0) in dashed red border boxes. The minimum values of mean and standard deviation of the error for each chimeric construct type are shown 
encircled and bold faced. 
doi:10.1371/journal.pone.0093532.g001 



Expression (22) can be interpreted as mutual information in Tsallis 
non-exten.sive .statistics, being a generalization of Eqn. (15): 



(23) 



As noted in [22], Iq (E; S) gives a measure of the independence 
of two random variables: Iq (E; S) = 0 for independent 
variables. In this case of statistically independent variables, 
the probability distribution of symbols ei is the same for both 
sequence segments. Here, S is interpreted as a random variable 
with probability distribution given by the weights Tij. 

3. Markov Model Generalization of JSD 

The standard JSD measure assumes each symbol in a 
sequence to occur independent of the others. In order to 
account for short range interdependence between symbols, 
JSD can be generalized by means of conditional entropy. This 
generalization can be obtained in the framework of Markov 
chain model of order m, where the occurrence of a symbol is 
dependent on the m preceding symbols in the sequence. The 
JSD corresponding to Markov sources can be obtained 
following the steps in the derivation of JSD (Eqn. 6-8) for 
the independent and identically-distributed (i.i.d.) sources. For 
example, for a Markov source of order m, where the 
occurrence of symbol p, depends on its just preceding context 
w of length m. 



■£';'[Pi,P2l = 

"1 X^i'i(»')X^.Pi<''.j»')log- 



Pi(ei\K) 

,j"')+ 

pi(ei\w) 



r'fi'""' , > /'l(g,j"0+ , . P7(c,|iv) ozLI 



„l,lW;.2<-2W ^'fel'*')+ .lPlW + .2/.2W ^^<'''"l'''' 



which leads to, after rearranging, 

^',"[Pl>P2l = 

- ^ ^ [Ti\p\{n')pi(ei\w) + n2P2{w)p2{ei\\\ 

II' / 

nxpx {w)pi (e,| w) + n2P2iw)p2{ei\w) 



log 



E 

i 

log 



■^iPi(w) + n2P2(w) 

- E l^iPii'^^O + T^iPiiw)] 

nxpx(n')pi(ei\ w) + n2P2 (w)p2 (e,- 1 w) 
■n.ipi(w) + n2P2(w) 

T^\P\ (l^QPl (g; I'tQ + Tt2P2{w)P2{ei\\^') 
■^lPl{^^') + T^2Pli}v) 



7i,i/r[pi]-7r27/;"[pJ, 
(25) 



-7ti//;"[pii-7r2i/;"[P2i. 



Therefore, 



D"^ [p, ,P2l = H'l' [tT, p, + 7t2P2l - TTl [Pl 1 - ^^2^/^' [P2I ■ (26) 

Here 7/J"[.] corresponds to entropy function for Markov sources 
of order m. 



^'i"[Pl = - E^^^'l"') (27) 

In contrast to Lamberti and Majtey's generalization within the 
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Tsallls-Markovian JSD for model order 3, E. coll + Y. pestis 
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Figure 2. Mean values of non-extensive MJSD at each position of tKie cKiimeric sequence constructs f. coli ® Y. pestis, for tKie 
parameter setting at whicli tKie non-extensive MJSD achieved most pronounced error reduction lq=2, order 3). The chimeric 
constructs of size 20 Kbp are comprised of two equal sized sequences, with each component sequence of length 1 0 Kbp obtained from the genome 
of each organism. 

doi:1 0.1 371 /journal.pone.0093532.g002 



Tsallis non-extensive statistical framework [6] (Eqn. 21), this 
generalization takes the full entropic form. Thakur et al. 
introduced "Markov models for genomic segmentation" (MMS) 
[7] , where they replaced the BGSE with Markovian entropy (Eqn. 
27) in the expression of JSD (Eqn. 8), which is amenable to 
interpretation lA. They also derived this generalization, which we 
call Markovian JSD (MJSD) introduced earlier in [8] , using the 
likelihood function (interpretation IC). 

This generalization could also be interpreted in terms of 
conditional mutual information, consistent with interpretation IB 
(Eqn. 15), 



I["{E; S\ W) = Y,p(e„st,n') In 'f ^ . (28) 
^ p(.ei\w)n(si\w) 

Making use of the conditional entropy definition and after some 
algebraic manipulation, one can identify D"^ = /"'according to 
interpretation IB. 

4. Non-extensive Markovian JSD Generalization 

We obtain the generalization of MJSD within the framework 
of Tsallis non-extensive statistics. This integrates two different 
generalizations of JSD, the Markovian and the Tsallis 



PLOS ONE I www.plosone.org 



5 



April 2014 I Volume 9 | Issue 4 | e93532 



Generalization of Entropic Divergence Measures 



Distribution of positions of maximun divergence for model order 3, E. coli + Y. pestis 
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Figure 3. Frequency distribution of position witKi maximum value of non-extensive MJSD for the chimeric sequence constructs E. 
coli (B y- pestis, for the parameter setting at which the non-extensive MJSD achieved most pronounced error reduction (q=2, order 

3). The chimeric constructs of size 20 Kbp are comprised of tw/o equal sized sequences, with each component sequence of length 10 Kbp obtained 
from the genome of each organism. 
doi:1 0.1 371 /journal.pone.0093532.g003 



generalization, thus yielding a generalization of which many of 
the previously described JSD generalizations are special cases. /'l(<",|H)+/J2(e/l") 

The non-extensive conditional or Markovian KuUback-Leibler L!" = — ^ 

divergence between two distributions pi and p2 is defined 



2 




P\(ei\w) 




■|tt')+/)2(e,-|"') 


2 



P\{ei\w) 

Using the above, the symmetrized L-divergence in Tsallis- Thus, we get, 
Markovian framework can thus be obtained as. 



(30) 
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L"'= ^VV 

" W I 



((\ 1 \ 

I ^P\{ei\w) + ^P2{ei\w)j 



\ 



Rearranging, 



^ ^\p\{eiVv)+ ^P2{ei\w) 



{Pi(et\w)y-^ 



(Piieilw)) 



1-? 



1 



v 



(31) 



-EE 



( 71 i_p 1 ( w) 1 (e,- 1 w)] * + 7i2i72 (w) [p2 (e,- 1 w)] * 



iq r-r^ r^;'i(e/|M^)+ — -— r^/'2(e,|n') 



(35) 



nipi{w) + n2P2(w) nipi(w) + n2P2(w) 

nipiiw)\pi (e,- 1 w)] Hqpi (e,- 1 w) - n2P2iw) \p2iei \ w)] 'Iq/JiCe; | w) 



^EE 



(pi(H')[pi(e,|H')]*+i;2(vv)b2(e,|vv)]'') 



^T?! (e,- 1 w) + -P2{ei I w) ] -pi {w,e,) -p2{w,e,) 



(32) 



Therefore, 



Y^Y. [(/'i("')bifel»')]'+K(>i*2(£',|iv)|")lq(i/'i(ei|«')+ ipifeiw) 



1 / P\{w)\pi{ei\w)]'' +P2{w)\p2(ei\w)\'' 



\ -p\(w)pi(ei\w)-pi(w)p2(ei\w) 



1-9 1 



VKMOpifcliiO^bifcliiOl""'-!) 



(33) 



= - |(/)i(iv)[p,(e;|MOI''+/'2(vi')b2(«.|»')l'')lq(^^/'i(e.|H')+^;'2(e,iw) 



The Tsallis-Markovian generalization for equal weights for the 
two distributions pi and p2 (7C^ = 0.5, 7C^ = 0.5) could thus be 
expressed as. 



^11 

2'2 



EE 



\p\(w)\pi(ei\w)]'' + ]^P2(w)\p2{ei\w)]'' 



lqUj'i(ei|w)+ - \pi{w)\p\{ei\w)\''\qpi{ei\w) (34) 



1 



- 2i^2(H')[/'2(e,|H')]''lc|;'2(e,|M') 



The generalization to any weights n\ and 7t2 (from 
Tti = -,712 = 2) associated to the joint distributions P](M',e) 
and P2(w,e) respectively is straightforward: 



Note that the above generalization does not take an entropic form 
or admit replacement of BGSE with non-extensive conditional 
entropy in Eqn. 8 or II (interpretation lA), however, it can be 
interpreted as mutual information (interpretation IB) as demon- 
strated below. 

Beginning with the conditional mutual information, 



IV I j 



we identify, as in 9 = 1 cases (Eqns. 15 and 28), that-D™=/™. 

If conditional probabilities />(e, |M')and ^(5, |M')are independent, 
then 



p(ei,Sj I w) =p(ei I w)n{Sj\w), 



(37) 



and in this situation, I^(E; S\ W) = 0, so that the conditional 
mutual information is a measure of the independence of the 
conditional probabilities. 

Eqn. (36) can be rewritten as, by means of Iq definition. 



I-^{E-S\W) = 

^ 'Y^ p{w,ei,Sj) 
w i j ^ ? 

By means of Bayes' theorem. 



>(e,|vv)7r(5'/|vv) 



P(ei,Sj\w) 



1-1 



(38) 



niSj\w)-- 
We may rewrite, 

i-(E;s\iv)=-J2EJ2 



p(w\Sj)n(Sj) _p(w,Sj) 



p(w) 



p(n\ef,Sj) 



piw) 



'p(ei\w)p(Sj,wj 



(39) 



\[ p(H',ei,Sj) 

^ ^ ^ \p{w,ei,Sj)]'' I [Pi^i I w)p{Sj I H')] ' 

'-'I \-l + X-[p(,,,e,,S,]'-^ 



1-'/ 



All-? 



(40) 



\ + i-[p(ei\w,Sj)\' 7 
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And, therefore, the generalization can be obtained as, 

w i j 

Notice that for model order 0, Eqn. 41 reduces to Lamberti and 
Majtey's non-extensive generalization [6] (Eqn. 21), whUe in the 
limit q-^1, we recover Thakur et al.'s Markovian generalization 

[7] . Note that [S\ ,82] = (d^^ ,^2 ) (Eqn. 35) and therefore, the 

Tsallis-Markovian generalization of JSD has its interpretation in 
mutual information. 

Experiments and Assessment 

To assess the discriminative abilities of JSD and its generalized 
forms, we compiled a test set of chimeric sequence constructs by 
con[:atenating DNA sequences from phylogenetically distinct 
organisms. Let 5' be a sequence composed of symbols from an 
alphabet of k symbols (i=l,...,k). Let us fiirther assume that 
sequence Sis the concatenation of two subsequences Sj and S^. Let 
PSj(ed denote the probability of symbol e, in subsequence 5), and 
p{Sjj, or simply Uj, the weight associated with the distribution pj 
(j = 1,2). Since the actual probability/>5j(e,)is often not known, the 
relative frequency of symbol in subsequence Sj,fsjiei), is used as 
the estimate of/'s^(e,). Thus, [pi, P2] or its generalizations for 
given subsequences .S"; and S2 is, in effect, a measure of the 
difference between the estimates of pi and pj. We use weights nj 
proportional to the length of Sj, which was earlier found to be most 
appropriate for symbolic sequence analysis [9]. 

Chimeric sequence constructs were obtained by concatenating 
two equal size sequence segments selected randomly from the 
genomes of two different organisms. We chose four phylogenet- 
ically distinct orgnmsms— Escherichia coli, Salmonella enterica, Yersinia 
pestis and Haemophilus influenzae, the first three belongs to the family 
Enterohacteriaceae and the fourth is an outgroup belonging to the 
family Pasteurellaceae. We obtained the sequence constructs of 
20 Kbp by concatenating 10 Kbp genomic segment from E. coli 
with 1 0 Kbp segment from one of the other three organisms. The 
phylogenetic proximity of these organisms from E. coli is in the 
following order: S. enterica > Y. pestis > H. influenzae. We subjected 
the non-extensive MJSD to detecting the join point of the two 
disparate sequence segments. A cursor was moved along the 
chimeric sequence construct and the non-extensive MJSD was 
computed for sequence segments left and right to the cursor. The 
position where non-extensive MJSD was maximized was noted. 
The error in localizing the join point was obtained as the absolute 
difference between the position where the non-extensive MJSD 
was maximized and the position of the join point in a sequence 
construct (for sequence constructs of 20 Kbp, the maximum and 
minimum possible error would thus be 10,000 bp and 0 bp 
respectively). 

For experiments with 10,000 replicates for each, E. coli@S. 
enterica, E. coli @ Y. pestis, and E. coli@H. influenzae (©denotes 
concatenation), the mean errors in detecting the join point for 
standard JSD (?= 1, order 0) were 4072, 3400 and 589 bp 
respectively, consistent with the order of divergence of -E. coli from 
the other three organisms, with H. influenzae being the outgroup 
(Figure 1). For the non-extensive generalization [q varies, order 0; 
error statistics shown within three rectangular boxes with dashed 



red borders in Figure 1), the minimum mean errors (in the same 
order of divergence from E. coli) were observed to be 4053, 3381 
and 588 bp for q in the range 1.5 — 2.0. Since H. influenzae is 
phykjgenetically distant from E. coli, the generalization induces 
very minor improvement while for the others, all belonging to the 
same family, the generalization induces more improvement 
apparendy due to more rooms for improvement in these cases. 
In contrast, for the Markovian generalization {q=\, order varies; 
error statistics shown within rectangular box with dashed green 
borders in Figure 1), the improvements were substantially more 
pronounced with corresponding minimum mean errors being 
2949, 1959 and 271 bp at order 2, 3 and 3 respectively. This large 
improvement is apparendy due to the Markovian generalization 
accounting for short-range correlations in the nucleotide ordering 
in genomic sequences, which is not considered in the non- 
extensive generalization. As expected from the above results, 
the non-extensive Markovian generalization induces further 
improvement over the Markovian generalization, generating the 
respective minimum mean errors of 2907, 1788 and 271 bp at 
different combinations of q and model order (shown encircled and 
bold faced in Figure 1). Clearly, the non-extensive generalization 
reaches saturation in improvement at large phylogenetic distances 
between the organisms under comparison while it induces 
significant improvements for phylogenetically proximal organisms. 
Indeed, the reduction of more than 40 bp in error for E. coli®S. 
enterica and 170 bp for E. coli @ Y. pestis is remarkable considering 
that these organisms are phylogenetically very close and therefore 
difficult to differentiate in their genomic composition [13]. The 
higher values of standard deviation from the mean are likely 
because of the non-homogeneity of the bacterial genomes. A 
significant portion (~l-20%) of bacterial DNAs is mobile and 
therefore distinct from the ancestral DNAs acquired though the 
reproductive processes [23]. The mean values of non-extensive 
MJSD at each position of the chimeric sequence constructs E. coli 
® Y. pestis and the frecjuency distribution of position with 
maximum value of non-extensive MJSD for these sequence 
constructs are shown in Figure 2 and Figure 3 respectively, for 
the parameter setting at which the non-extensive MJSD achieved 
most pronounced error reduction {q = 2, order 3). Notably, the 
value of MJSD increases monotonically with increase in q or 
model order or both (Figure 2). A sharp spike in the distribution 
around position 10 Kbp demonstrates the efficiency of the 
divergence measure in localizing the join point of E. coli and Y. 
pestis sequences (Figure 3), with the best performance nt q=2 and 
order 3 setting (Figure 1). We show in Figures SI — S15 these data 
for all three kinds of sequence construct and at all parameter 
settings. 

In Figure SI 6, we show the error statistics for cases when the 
chimeric sequence constructs of 20 Kbp had 5 Kbp from a non-i?. 
coli organism {S. enterica, Y. pestis or H. influenzae) and the remaining 
15 Kbp from E. coli. The variable length taxonomically distinct 
sequences within chimeric constructs present significantly more 
challenge for the statistical methods than the chimeric constructs 
with similar size sequences. As expected, the mean errors in 
detecting the join point increased in all cases. The Markovian 
generalization still results in much better performance than the 
non-extensive generalization, while the non-extensive Markovian 
generalization led to a more pronounced improvement for E. coli 
® Y. pestis (a reduction of 295 bp in mean error compared with 
the Markovian generalization). Non-extensive generalization of 
MJSD didn't induce further improvement for E. coli®S. enterica, 
likely because of the weakened discriminatory signal as a 
consequence of reduction in the size of S. enterica fragments. 
Figures S17 — S31 provide plots for divergence values at each 
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sequence position as well as frequency distributions of position 
with maximum divergence for all three kinds of sequence construct 
and at all parameter settings. The discrimination of DNA 
sequences from phylogenetically close relatives such as E. coli 
and S. enterica is difficult, yet this study shows that there are still 
rooms for improvement with the development of more flexible, 
sensitive methods. Overall, the non-extensive Markovian general- 
ization results in improved efficiency in discriminating sequences 
from phylogenetically proximal organisms. 

Conclusions 

The proposed generalization of JSD in the integrated frame- 
work of Tsallis and Markovian statistics provides a powerful tool 
for symboUc sequence analysis. In application to deconstructing 
the chimeric bacterial sequences, the TsaUis-Markovian general- 
ization achieved remarkable improvement over both — the TsaUis 
as well as the Markovian generalization. The superior pc-rfor- 
mance of TsaUis-Markovian JSD was most pronounced when the 
sequences under comparison arose from phylogenetically proximal 
organisms. E. coli, S. enterica and T. pestis, all belong to the same 
Enterobacteriaceae family; previous studies have shown the limitations 
of JSD in distinguishing sequences from organisms belonging to 
the same family [13]. Therefore, the improvement achieved by the 
proposed generalized measure is an important step forward in 
interpreting the biological data which often have heterogeneities at 
varying levels. While for the first time, to the best of our 
knowledge, the theoretically distinct generalizations of JSD 
accomplished by different research groups have been brought to 
one place for comparison and assessment, this study has also 
bridged the gaps in the field by obtaining generalizations 
consistent with the original proposal for JSD derivation and by 
providing the interpretations in the framework of statistical 
physics, information theory and mathematical statistics, where 
possible. The proposed divergence measure, generalized in the 
integrated framework of Tsallis and Markovian statistics, provides 
a new exploratory tool, augmented in both power and flexibility, 
to mine the symbolic sequence data. 

Supporting information 

Figure SI Mean values of non-extensi\'(^ \IJSD at each position 
of the chimeric sequence constructs E. coli @ S. enterica, for model 
order m — 0-3. For each model order, plots are shown for difiFerent 
values of Tsallis statistics' parameter q, in the range 0.5-3. The 
chimeric constructs of size 20 Kbp are comprised of two equal 
sized sequences, with each component sequence of length 10 Kbp 
obtained from the genome of each organism. 
(TIF) 

Figure S2 Mean values of non-extensive MJSD at each position 
of the chimeric serjuence constructs E. coli © S. enterica, for Tsallis 
statistics' parameter q=Q.b, 0.7, 1.0, l.,T. For each q, plots are 
shown for different model orders, in the range 0—3. The chimeric 
constructs of size 20 Kbp are comprised of two equal sized 
sequences, with each component sequence of length 10 Kbp 
obtained from the genome of each organism. 
(TIF) 

Figure S3 As in Figure S2, but for Tsallis statistics' parameter 

? = 2.0, 2.5, 3.0. 

(TIF) 

Figure S4 Mean values of non-extensive MJSD at each position 
of the chimeric sequence constructs E. coli ® T. pestis, for model 
order m = 0-3. For each model order, plots are shown for difiFerent 



values of Tsallis statistics' parameter q, in the range 0.5-3. The 
chimeric constructs of size 20 Kbp are comprised of two equal 
sized sequences, with each component sequence of length 10 Kbp 
obtained from the genome of each organism. 

(TIF) 

Figure S5 Mean values of non-extensive MJSD at each position 
of the [Jiimeric sequence constructs E. coli © T. pestis, for Tsallis 
statistics' parameter q — 0.5, 0.7, 1.0, 1.5. For each q, plots are 
shown for different model orders, in the range 0—3. The chimeric 
constructs of size 20 Kbp are comprised of two equal sized 
sequences, with each component sequence of length 10 Kbp 
obtained from the genome of each organism. 
(TIF) 

Figure S6 As in Figure S5, but for Tsallis statistics' parameter 

q^2.0, 2.5, 3.0. 

(TIF) 

Figure S7 Mean values of non-extensive MJSD at each position 
of the chimeric sequence constructs E. coli © H. influenzae, for 
model order m = 0-3. For each model order, plots are shown for 
different values of Tsallis statistics' parameter q, in the range 0.5-3. 
The ( himcric constructs of size 20 Kbp are comprised of two 
equal sized sequences, with each component sequence of length 
1 0 Kbp obtained from the genome of each organism. 
(TIF) 

Figure S8 Mean values of non-extensive MJSD at each position 
of the chimeric sequence constructs E. coli © H. influenzae, for 
Tsallis statistics' parameter q = Q.b, 0.7, 1.0, 1.5. For each q, plots 
are shown for different model orders, in the range 0-3. The 
chimeric constructs of size 20 Kbp are comprised of two equal 
sized sequences, with each component sequence of length 10 Kbp 
obtained from the genome of each organism. 
(TIF) 

Figure S9 As in Figure S8, but for Tsallis statistics' parameter 

9 = 2.0, 2.5, 3.0. 

(TIF) 

Figure SIO Frequency distribution of position with maximum 
value of non-extensive MJSD for the chimeric sequence constructs 
E. coli @ S. enterica, for model order m = 0 (A, B) and 1 (C, D). For 
each model order, distributions are shown for different values of 
Tsallis statistics' parameter q, in the range 0.5-3. The chimeric 
constructs of size 20 Kbp are comprised of two equal sized 
sequences, with each component sequence of length 10 Kbp 
obtained from the genome of each organism. 
(TIF) 

Figure Sll As in Figure SIO, but for model order m = 2 (E, F) 

and 3 (G, H). 

(TIF) 

Figure SI 2 Frequency distribution of position with maximum 

value of non-extensive MJSD for the chimeric sequence constructs 
E. coli © T. pestis, for model order m = 0 (A, B) and 1 (C, D). For 
each model order, distributions are shown for different values of 
Tsallis statistics' parameter q, in the range 0.5-3. The chimeric 
constructs of size 20 Kbp are comprised of two equal sized 
sequences, with each component sequence of length 10 Kbp 
obtained from the genome of each organism. 
(TIF) 

Figure S13 As in Figure S12, but for model order m = 2 (E, F) 

and 3 (G, H). 
(TIF) 
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Figure S14 Frequency distribution of position with maximum 
value of non-extensive MJSD for the chimeric sequence constructs 
E. coli ® H. influenzae, for model order m = 0 (A, B) and 1 (C, D). 
For each model order, distributions are shown for different values 
of Tsallis statistics' parameter q, in the range 0.5-3. The chimeric 
constructs of size 20 Kbp are comprised of two equal sized 
sequences, with each component sequence of length 10 Kbp 
obtained from the genome of each organism. 
(TIF) 

Figure S15 As in Figure S14, but for model order m = 2 (E, F) 

and 3 (G, H). 

(TIF) 

Figure S16 Error (in base pairs) in detecting the join point in the 
chimeric sequence constructs for E. coli®S. enterica, E. coli®T. 
pestis, and E. coli@H. i?ifluenzae (©denotes concatenation). The 
proposed Tsallis-Markovian generalization of the Jensen-Shannon 
divergence measure was used to obtain the mean and standard 
deviation of the error from 5,000 replicates for each type of 
chimeric sequence constructs. The error in localizing the join 
point was obtained as the absolute difference between the position 
where the divergence was maximized and the position of the join 
point (at 5 Kbp) in a chimeric serjuence construct of size 20 Kbp 
(5 Kbp sequence from non-£. coli organism concatenated with 
15 Kbp from E. coli). Error statistics for the two special cases of the 
proposed generalized measure is shown within rectangular boxes- 
the Markovian generalization {q= 1) in dashed green border box 
and Tsallis non-extensive generalization (model order = 0) in 
dashed red border boxes. The minimum values of mean and 
standard deviation of tiu; error for each chimeric construct type 
are shown encircled and bold faced. 
(TIFF) 

Figure S17 Mean values of non-extensive MJSD at each 
position of the chimeric sequence constructs E. coli ® S. enterica, 
for model order m = 0-3. For each model order, plots are shown 

for different values of Tsallis statistics' parameter q, in the range 
0.5-3. The chimeric constructs of size 20 Kbp are comprised of 
two sequences, one component sequence of length 5 Kbp obtained 
from the genome oi S. enterica and the other of length 15 Kbp from 
the genome of i?. coli. 
(TIF) 

Figure S18 Mean values of non-extensive MJSD at each 
position of the chimeric sequence constructs E. coli ® S. enterica, 
for Tsallis statistics' parameter q = 0.b, 0.7, 1.0, 1.5. For each q, 
plots are shown for different model orders, in the range 0-3. The 
chimeric constructs of size 20 Kbp are comprised of two 
sequences, one component sequence of length 5 Kbp obtained 
from the genome of & enterica and the other of length 15 Kbp from 
the genome of E. coli. 
(TIF) 

Figure S19 As in Figure SI 8, but for Tsallis statistics' parameter 

g = 2.0, 2.5, 3.0. 

(TIF) 

Figure S20 Mean values of non-extensive MJSD at each 

position of the chimeric sequence constructs E. coli © T. pestis, 
for model order m — 0—3. For each model order, plots are shown 
for different values of Tsallis statistics' parameter q, in the range- 
0.5-3. The chimeric constructs of size 20 Kbp are comprised of 
two sequences, one component sequence of length 5 Kbp obtained 
from the genome of T. pestis and the other of length 15 Kbp from 
the genome of E. coli. 
(TIF) 



Figure S21 Mean values of non-extensive MJSD at each 
position of the chimeric sequence constructs E. coli ® T. pestis, 
for TsaUis statistics' parameter §'=0.5, 0.7, 1.0, 1.5. For each q, 
plots are shown for different model orders, in the range 0-3. The 
chimeric constructs of size 20 Kbp are comprised of two 
sequences, one component sequence of length 5 Kbp obtained 
from the genome of T. pestis and the other of length 1 5 Kbp from 
the genome of £. coli. 
(TIF) 

Figure S22 As in Figure S2 1 , but for Tsallis statistics' parameter 

g = 2.0, 2.5, 3.0. 

(TIF) 

Figure S23 Mean values of non-extensive MJSD at each 
position of the chimeric sequence constructs E. coli ® H. influenzae, 
for model order m = 0-3. For each model order, plots are shown 
for different values of Tsallis statistics' parameter q, in the range 
0.5—3. The chimeric constructs of size 20 Kbp are comprised of 
two sequences, one component sequence of length 5 Kbp obtained 
from the genome oi H. influenzae and the other of length 15 Kbp 
from the genome of ii. coli. 
(TIF) 

Figure S24 Mean values of non-extensive MJSD at each 
position of the chimeric sequence constructs E. coli ® H. influenzae, 
for Tsallis statistics' parameter q = 0.b, 0.7, 1.0, 1.5. For each q, 
plots are shown for different model orders, in the range 0-3. The 
chimeric constructs of size 20 Kbp are comprised of two 
sequences, one component sequence of length 5 Kbp obtained 
from the genome of//, influenzae and the other of length 15 Kbp 
from the genome of E. coli. 
(TIF) 

Figure S25 As in Figure S24, but for Tsallis statistics' parameter 

? = 2.0, 2.5, 3.0. 

(TIF) 

Figure S26 Frequency distribution of position with maximum 
value of non-extensive MJSD for the chimeric sequence constructs 

E. coli © S. enterica, for model order m = 0 (A, B) and 1 (C, D). For 
each model order, distributions are shown for different values of 
Tsallis statistics' parameter q, in the range 0.5-3. The chimeric 
constructs of size 20 Kbp are comprised of two sequences, one 
component sequence of length 5 Kbp obtained from the genome of 
S. enterica and the other of length 15 Kbp from the genome of/?. coR. 
(TIF) 

Figure S27 As in Figure S26, but for model order m = 2 (E, F) 

and 3 (G, H). 

(TIF) 

Figure S28 Frequency distribution of position with maximum 

value of non-extensive MJSD for the chimeric sequence constructs 
E. coli © r. pestis, for model order m = 0 (A, B) and 1 (C, D). For 
each model order, distributions are shown for different values of 
Tsallis statistics' parameter q, in the range 0.5-3. The chimeric 
constructs of size 20 Kbp are comprised of two sequences, one 
component sequence of length 5 Kbp obtained from the genome 
of Y. pestis and the other of length 1 5 Kbp from the genome of E. 
coli. 
(TIF) 

Figure S29 As in Figure S28, but for model order m = 2 (E, F) 

and 3 (G, H). 

(TIF) 

Figure S30 Frequency distribution of position with maximum 
value of non-extensive MJSD for the chimeric sequence constructs 
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E. coli @ H. influenzae, for model order m = 0 (A, B) and 1 (C, D). 
For each model order, distributions are shown for different values 
of TsaUis statistics' parameter in the range 0.5-3. The chimeric 
constructs of size 20 Kbp are comprised of two sequent:es, one 
component sequence of length 5 Kbp obtained from the genome 
of//, influenzae and the other of length 15 Kbp from the genome of 
E. coli. 
(TIF) 

Figure S31 As in Figure S30, but for model order m = 2 (E, F) 

and 3 (G, H). 

(TIF) 
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